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Abstract 

Top-down parsing has received much attention recently. Parsing expression grammars 
(PEG) allows construction of linear time parsers using packrat algorithm. These techniques 
■ however suffer from problem of prefix hiding. We use alternative formalism of relativized 

regular expressions REG REG for which top-down backtracking parser runs in linear time. 
This formalism allows to construct fast parsers with modest memory requirements for 
practical grammars. We show that our formalism is equivalent to PEG. 
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^ ■ 1 Introduction 

A top-down parsing implementation can be viewed as bunch of mutually recursive functions 
["T i ' recognizing individual rules in grammar description. Naively implemented parser of the follow- 

^ , ing rule: 

R="aa" R I "a" R 

on "aaaaaaaaaaaaaaaaaaa. . . " can take exponential time. 

Incorporating left recursion also causes problems. A naive parser of 
L=L a 

would call L infinitely many times. Various approaches were suggested to solve both problems. 

In natural language processing we typically want to enumerate possible interpretations of 
ambiguous grammar. 

Frost [9] gave 0(n 4 ) algorithm that outputs compact representation of all parses [21] and 
handles left recursion as recursive descend. Parsing expression grammars allow unlimited looka- 
head. Okhotin [T7] suggest to extend context free grammars with lookahead to class of boolean 
grammars. Again his algorithm for boolean grammars had complexity 0(n 4 ). Both these algo- 
rithms were improved by variant of Valiant algorithm [23] to obtain complexity 0(M(n) log n) 
where M(n) is time of matrix multiplication. When boolean grammars are restricted to unam- 
biguous boolean grammars there exists 0(n 2 ) algorithm. 

For programming languages ambiguity is undesirable. One of approaches are parsing ex- 
pression grammars defined by Ford |S]- A parsing expression grammars (PEG for short) can 
be viewed as a top-down parser that places three additional constraints. First is that rules are 
deterministic. Second is restricting choice operator I to ordered choice operator /. Once an al- 
ternative of ordered choice succeeds then choice succeeds and we do not backtrack if something 
fails later. Third is that iteration is greedy and does not backtrack. 

This definition without backtracking introduced problem of prefix hiding, an expression 
"a"/"ab" does not match string "ab". 

Seaton in his Katahdin language [20] uses different longest choice operator to partially solve 
this problem. A longest choice tries all alternatives and deterministically chooses a longest 
match. However this does not eliminate the prefix hiding completely. Parser of: 
"„"* "„foo" (* is iteration operator) still does not match "_/oo". 

We take another approach. Programming languages use only two types of recursion: iter- 
ation and nested recursion. By making this information explicit we can generate linear time 
parsers that are equivalent to the fully backtracking ones. 
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We present new formalism of relativized regular expressions REG . Our formalism re- 
laxes determinism of PEG grammars. As in PEG we support arbitrary lookaheads. Previous 
results can be easily derived using 

reg reg f orma ii sm . 

Although REG REG seems stronger than PEG we show that PEG and REG REG are equiva- 
lent. 

Author's Amethyst language implements REG REG parser with several extensions. One 
is that Amethyst allows parametrized rules like times (4, 'foo') and support lambdas like 
times(3, (I a|b|c I)). Amethyst language is described in author's thesis [2]. 

2 Structured grammars 

We devise an approach to describe programming languages which we call structured grammars. 
We build on an analogy with structured programming languages. 

As programs used arbitrary goto constructs, grammars use arbitrary forms of recursion. 
To make programs more readable, programming languages was extended by adding structured 
control flow constructs making it easier for developers to read the code on a local basis without 
spending hours to understand the whole context. We seek similar goals with introduction of 
structured grammars. 

Assume we are given a grammar for the fully-backtracking top-down parser. We say it is 
structured grammar if it satisfies following conditions: 

1. Transparency of semantic actions. We can imagine that parser is augmented by an oracle 
that may decide that alternative will eventually fail. The parser should display same 
output regardless if we tried that alternative and failed or used the hint from the oracle. 
Lookaheads form important case. We always revert actions made by lookaheads. 

2. Recursion is restricted to iterative and nested recursion. 
Iterative: 

For example arguments of function in C are lists of expressions separated by " ,". We 
typically use iteration * operator. Iteration can be also described by left recursive or by 
right recursive rules. When possible iteration should be written in way that is associative. 
Nested recursion: 

What is not iteration can be described by start and end delimiters. We require user to 
annotate this concept by operator nested(siari , middle, end) . 

Simplest example are properly parenthised expressions. They can be described as: 
exp = nested( ' ( ' , ( I exp I),')') 

We show two less trivial examples in Amethyst language syntax. A while loop in C is 
matched by: 

'while' exp nested('{' , ( I stmts I),'}') 

Python uses indentation to describe nesting. We use a semantic predicate to find where 
we end. We match python while loops in python as: 
nested((| '\n' ' '*:x 'while' exp I), (I stmts I), 
(I &('\n' ' '*:y &{x . size>y . size}) I)) 

A nesting should satisfy three natural conditions. 

2.1. Position of end delimiter is determined by position start delimiter. 

2.2. When nested starts in smaller position it should end in strictly larger position. 

2.3. When both nested(siari ,midl, end) and nestedi start , mid2, end) match string 
then their end positions should agree. 
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Note that programming languages implicitly follow this convention. Other types of recursion 
are undesirable because user cannot reason about them locally. 

One of reasons is that programming languages were described as deterministic context free 
grammars. Thus they can be written by deterministic push-down automaton. We can model 
push/pop pair by calling nested. Indeed if we did not include lookaheads our class would be 
equivalent to class of deterministic context free grammars. We leave proof as an exercise. 

Structured grammars offer additional advantages. For example we can use the structure 
information to semiautomatically construct error correction tool. 

For equivalence with top-down parser our parsing algorithm needs condition 2.1. Without 
condition 2.2 a parser would be quadratic instead linear time. Condition 2.3 is design guideline 
which is not needed in our algorithm. 

3 PEG and REG REG operators. 

A parsing expression grammars [5] are defined by the following operators. 



's' 


Match string. 


r 


Rule application. 


el e2 


Sequencing. 


el/e2 


Ordered choice. 


e* e+ 


Iteration. 


fee ~e 


Positive and negative lookahead. 


{a} &{a> 


Semantic action and predicate. 



We relax determinism of PEG to REG expressions. We can describe every structured 
grammar by REG R rules with linear time guarantee. 

A REG REG expressions mostly use the same operators as PEG. Difference is that operators 
do backtracking except of nested which behaves dcterministically. 



nested(start , mid, end) 


Nested operator. 


el|e2 


Priorized choice. 


e* e+ 


Backtracking iteration. 


el[e2] 


Enter operator. 



Enter operator is described in section [6] 



3.1 Simple algorithm 

We will describe our parser in functional programming style pseudocode. We denote lambda 
as: 

\lambda(arguments) {body} and call it with call method. 
Rest of code is self explanatory. 

We start with simple implementation and will progressively add more details. 

A REG REG parser behaves mostly as a top-down parser. We use function match(e , s , cont) 
where e is expression we match, s is current position and cont is continuation represented as 
lambda. 
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match( r ,s,cont) = match (body (r) ,s,cont) 



match('c' ,s,cont) = if s.head=='c' ; cont . call(s .tail) 

else ;fail 

match(e f ,s,cont) = match(e , s , \ (s2) { match(f , s2 , cont) } 

match(e|f ,s,cont) = if match(e , s , cont) ; success 

else ;match(f , s , cont) 

match(~e ,s,cont) = if match(e,s,\(s2){success}) ;fail 

else ; cont . call(s) 

match(e* ,s,cont) = 

cont2 <- \(s2){ if match(e , s2 , cont2) ; success 

else ; cont . call (s2) 

} 

cont2 . call 

Pseudocode above describe naive top-down parser. For REG REG class we restrict recursion and 
add nested operator: 

match(nested(st ,mi , en) ,s,cont) = 

s3 <- match((st mi en) , s , \ (s2) {success}) 
if s3 ; cont . call (s3) 
else ; fail 

3.2 Equivalence with top-down parsers and PEG 

We prove that for structured grammars REG R parser finds same derivation as fully back- 
tracking one. As top-down parser does not directly support left recursion we do not consider 
left recursion in this section. 

An implementation of the fully backtracking parser is same as the implementation of 
REG REG parser in section [Ol except of nested: 

match(nested(st ,mi , en) , s, cont) = match((st mi en), s, cont) 

For sake of proof we transform rewrite implementation of nested in REG REG parser to 
equivalent one. In nested we only consider first alternative in the way following pseudocode 
suggests: 

match(nested(st ,mi , en) , s, cont) = first <- true 
match(s, (st mi en), \(s2){ 
if first ; first <- false 

; cont . call(s2) 
else ; fail 

» 

An equivalence with top-down parser can be proved by easy induction on the nesting level. 
1. When expression contain no nesting we have identical implementation. 
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2. Assume we proved proposition for nesting level £—1. We prove level I by second induction 
on the number of nested calls in the continuation on level I — 1. 

(a) For continuation that does not call nested we use same argument as in 1. 

(b) Assume we have continuation that calls nested n times. Consider first time we call 
nested. If this call fails it, by induction, also fails in the fully backtracking parser 
and we are done. 

Otherwise REG REG and the fully backtracking parser first try lexicographically 
smallest alternative in the recursion tree. If a continuation succeeds a derivation 
is same by induction. 

If a continuation fails we use assumption 2.1. of structured grammars. Our parser 
does not try alternatives further. A backtracking parser enumerates all derivations. 
As every derivation ends in same position and continuation will always fail. Thus 
the backtracking parser behaves like REG REG parser. 

Like not every C program is structured program not every REG REG grammar is structured 
one. We can use nested with empty start and end to implement PEG operators. This gives 
us inclusion PEG C REG REG . An opposite inclusion is true but not very enlightening. As 
there are only finitely many pairs (e, cont) we can for each pair write a PEG rule that emulates 
reg reg a i gor ithm. 

For linear time guarantee we still require every recursion except left and right recursion to 
be annotated by nested. 

4 Relativized regular machines. 

To better understand languages recognized by relativized regular expressions we introduce the 
relativized regular machines that are similar to nondeterministic finite state machines [18] . We 
use this formalism as an inspiration for effective low-level implementation of parsers. 

It is easy to see that an continuation corresponds to right congruence class. We use repre- 
sentation that unifies identical expressions and continuations. This can be viewed as NFA state 
minimization^. 

A relativized regular machine is similar to nondeterministic finite state machine. A machine 
can be described by triple M = (S, t, a) where 
S is set of states, 

t : (S, N, S) — 5- (M, S) set of transitions and 
a c S a set of accepting states. 
We have elementary machines that match single character. 

Transitions from state s are done in following way. We put (Mj,Sj) = t((s, i, fj—i)) then 
recursively call machine and if it succeeds we move to its end position and set state to Sj. 
Based of accepting state this choice reaches we choose a next choice. 

5 Effective implementation 

A implementation above runs in linear time but constant factor is quite high. For better 
constant factor our parser generator applies various optimizations. We use a low-level repre- 
sentation that is suitable for these optimizations. 

1 NFA minimization is NP-hard in general case. Our approach is a good heuristic. 
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In this section we describe parser that does not consider semantic actions. Semantic actions 
will be added in next section. 

Representation of expressions is similar to syntax tree but more compact. We use similar 
technique as compact representation of derivations in Tomita algorithm [21 : 

1. All nodes are immutable. 

2. We represent all identical subtrees by single object. When we are asked to construct a 
node optimizer first tries to simplify node by algebraic identities. If after simplification 
we obtain node identical to previously constructed node we return previously constructed 
node. 

We will again use function match(e,Args [...])-> Result [ ... ] . 
We will extend several times what Args and Result objects contain. Initially we define 
following fields: 

Args . s is starting position of string, 

Result . s is end position of string, 
Args . cont is a continuation. 
Objects Args and Result have method change that creates new object with appropriately 
changed fields. 

5.1 Sequencing 

We represent sequencing operator head tail by object with pattern Seq[head tail] . Repre- 
senting sequencing in this way allows tail parts to be shared. Implementation is straightfor- 
ward. 

match( Seq[head tail], a) = match(head, a. change ( 
cont:\(a2){ 

match (tail , a. change (cont : a. cont) ) 
») 

5.2 Choice and lookaheads 

Inspired by relativized regular machines we model choice and lookaheads by more general 
Switch operator. First we need add field Result . state. This state will be used to pass 
information from rules to the Switch operator. 

A Switch operator satisfies following pattern Switch [ head alt : {state=>tail} merge ]. 
Switch operator first matches a head. Then it looks what end state head reached and matches 
tail entry corresponding to that state. Finally it computes final state from states of children 
by merge method. 

For simplicity in this paper we use only two states success and fail. We use identity 
function as a merge method. We also add success and fail rules with obvious implementation: 

match(Rule [success] ) = Result [state : success] 
match (Rule [fail ]) = Result [state : fail ] 

This is quite general operator and we illustrate its uses on several examples. 

A choice operator backtracks until success state was reached. An implementation is: 

el|e2 -> Switch[ el {success: success 

fail: e2}] 
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Lookaheads can be modeled like: 

~e -> Switch[ Seq[e success ] 
{success: fail, 
fail: empty} ] 

fee -> Switch[ Seq[e success ] 

{success: empty, 
fail: fail } ] 

A Switch makes optimizations easy. 

Switches can be easily composed. To compose switches A and B a simplest way is to use 
states that are pairs (state from A, state from B). We need to define merge method to compute 
final state. We can represent these pairs compactly as bit vector. 

Another optimization is predication. When we know first character we can simplify expres- 
sion: 

Switch [ Result [ f irst_character ] 

{ 'a': expressions that can start by a, 
'b' : expressions that can start by b, 

For choice el |e2 we can, based on result of the partial match of el, simplify matching of 
e2. For example consider expression: 

(alb) c (dlf) 
I (blc) c f 

on string "bed". 

When first alternative matches "d" then we know that second alternative will not match. Last 
choice could pass state to inform first choice about this condition. 

An implementation of Switch is the following. We hide technical details to merge method. 
For details see our implementation [B]. 

match_memo (e , a) = 

if memo [e , a] ; memo [e , a] 

else ; memo [e, a] <- match(e,a) 

match(Switch [ head alt merge ],a) = 
r <- match_memo(head,a) 
r2 <- match_memo(alt [r . state] ,a) 
merge(r ,r2) 

5.3 Iteration 

We use low-level repeat-until operator to represent iteration. 



e** 


Many [stop e] 


repeat-until 


Stop 


Stop [stop] 


stop operator 



Repeat-until can terminate if and only if we encountered corresponding Stop in current 
iteration. We add stops field to Args to collect encountered stops. 

This allows to describe normal iteration e* and eager iteration e*? as (e I Stop)** and 
(Stop I e) ** respectively. Repeat-until is equivalent to right-recursion. For example we can flip 
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between rules 
R=aR|b IcRld 

and 

R = (a lb Stop I c Id Stop)*. 

Except of stop condition the implementation is nearly identical to implementation of * 
operator from section T3. II 

match (Stop [st] ,a) = a. cont. call ( a. change (stops : a.stops+st)) 
match (Many [st e] , a) = 
cont2 <- \(a2){ 

if a2. stops & st ; a.cont.call( a2 . change (stops : a2 . stops-st) ) 
else ; match(e , a2 . change (cont : cont2 )) 

} 

cont2 . call (a) 

5.4 Rule call 

Rule call only affects scope of variables. When no semantic actions are present we can directly 
move expression to separate rule and back. 

match(Rule[ e ], a) = match(e ,a) 

For nested we use similar implementation as before. 

match(Nested [st mi en], a) = 

r = match_memo(Seq[st mi en] ,Args [s : a. s , cont : \ (m) {success}] ) 
if r . state==success ; a. cont . call (a. change (s : r . s) ) 
else ; fail 

6 Semantic actions 

A parser from section [5] is not very useful as it can only answer yes/no questions. When we 
integrate parser generator in programming language called host language. We can specify host 
language expressions called semantic actions. For example we can use semantic actions for 
simple calculator: 

add = mul : x ' + ' add : y {x+y} 
I mul 

mul = number :x mul:y {x*y} 
I number 

While semantic actions are easy to add they complicate other parts of the parser. 

We add the following fields: 
Args . closure closure for semantic actions. 
Args . returned result of last expression. 
Result . returned returned result. 

We model semantic act as a function that modifies arguments. For simplicity we model 
variable binding by semantic act. 

match( Act [ f ] ,a ) = a2 <- f . call(a. closure) 
a. cont . call (a. change (a2) ) 
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Now we are ready to add enter operator. 

match( Enter[el e2] , a) = 

match(el,a.change(cont: \(a2){ 

match(a2 . change (s : a2 . returned) , cont : \(a3) { 
a. cont . call(a3 . change (s : a. s) ) 

} 

} 

Semantic actions in rule invocation have shared scope. We use closure object to achieve 
this. A rule invocation becomes: 

match( Rule [ e ] , a ) = match(e, 

a . change (closure : new_closure , 
cont : \ (a2) { a. cont . call (a. change (s : a. s , 

returned : a . returned) ) } 

) 

We can use host language expression called semantic predicate to decide if expression matched 
or not. This complicates memoization and we, for simplicity, disable memoization when seman- 
tic predicate is present. 

In Amethyst we also support parametrized rules and lambdas. They are bit technical to add. 
For parametrized rule we first model arguments by semantic act bound to argument variables. 
Then we add field consisting of pairs (argument variable, parameter variable) and we initialize 
new closure according to pairs. For lambda we bind (expression, closure) pair to corresponding 
variable. We disable memoization when parametrized rule is present for same reasons as with 
semantic predicate. 

Memoization becomes more technical. A simplest way how to get linear time complexity is 
to use two pass parser which in first pass run parser from section [5] and second time we just 
constructs parse tree. We refine this idea and run both phases in parallel. We use functor 
f orget_semantic_actions: 

match_memo_state (e , a) = 

if (has_predicate (e) I has_predicate (a) ) ; match(e,a) 
else ; e2 <- f orget_semantic_actions(e) 

a2 <- f orget_semantic_actions(a) 

if memo [e2 , a2] ; memo[e2,a2] 

else ; memo[e2,a2] <- match(e,a) 



A simple implementation of Switch can be 

match(Switch [ head alts merge ], a) 
r <- match_memo_state(head,a) 
r2 <- match_memo_state (alts [r . state] , a) 
if r2 . state==f ail 

fail(r2) 
else merge (match (head, a) , 

mat ch( alt s [r . state] ,a)) 

Sometimes Switch knows that result is not needed. Then we can directly call expression 
simplified by f orget_semantic_action. This always happens for lookaheads. 
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6.1 Time complexity 

Ford [5] rewrites iteration to recursion for linear time complexity. However most implementa- 
tions naively use a loop. 

It is possible to construct test cases where arbitrary (say k) number of loops are nested to- 
gether and each fails at the end of input. This leads to time complexity at least n k for arbitrary 
k. This can be seen on the following expression: 

( ( ( ( 'a' )* 'b' 
/ 'a' )* 'c' 
/ 'a' )* 'd' 
/ 'a' )* 'e' 

on "aaaaaaaaaaaaaaaaaaaaaaa. . . " 

We memoize continuations precisely for this reason. 

For parser from section [5] there are only finitely many expressions and continuations. Thus 
there are only O(n) memoization pairs (e,a). 

With semantic actions we sometimes need to recalculate the result. For a given pair 
(nested, position) we need to recalculate result of every (e,a) pair at most once. For general 
REG REG expressions time complexity 0{n 2 ) follows. 

For structured grammars this behavior cannot happen. We do not have to recalculate 
when result state is fail or we match in lookaheads. What is left is that we could have 
two invocations of same nested expression with two different positions that recalculates same 
(e , a) pair. But this would mean that both invocations will be accepted with same end position 
which is in contradiction with condition 2.2 of structured grammars. Consequently the parser 
of structured grammars runs in linear time 

With semantic predicates we can not give any complexity guarantee. To integrate them 
correctly we disable memoization when continuation contains semantic predicate. 

7 Memory consumption of REG REG parsers 

Mizushima et al [15] propose way to decrease the memory usage. We describe similar but 
simpler approach. 

The parser implementation maintain set of live branches in a list live. The list is maintained 
in the following way: 

• When parser descends into choice operator then its branches are added to live list. 

• When parser descends into branch, then it is removed from live list. 

• When parser encounters cut then branches that were cut are removed from live list. 

When live list is empty we know that subsequent parsing cannot return to position smaller 
than current. We can safely delete all memo entries with smaller position. 

One can observe that live list is not really needed. The implementation can be further 
simplified by only keeping track of the size of the list in a counter alternatives. 

The parser then deletes stale entries from memo table lazily. It keeps track of the rightmost 
position where alternatives was zero. At a time table expansion is needed, all earlier entries 
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are deleted. This avoids the need for the expansion if the table after deletion is at most half 
full. 

Note that if we want to incorporate destructive semantic actions we can in same way defer 
their evaluation until alternatives is zero. 

For practical grammars this extension gives nearly constant memory usage. However we 
can construct examples where this approach does not help, for example in expression: 

exp* 'x' I exp* 'y' 
we need to keep memoization entries until end is reached. 



8 From REG REG back to REG 

We establish a reg functor. This notion gives unified way to analyze REG REG expressions. 

A reg functor assigns to each relativized regular expression e an regular expression reg(e). 
A reg{e) satisfies approximation condition that if e accepts s then regie) accepts s but converse 
is not necessary true. 

We can extract useful information testing if the intersection with an suitable regular lan- 
guage is empty. 

empty(e) = reg(e ) D reg( ' ' ) 

f irst_char ( ' c ' , e) = reg(e ) l~l reg('c' .*) 
overlap(el,e2) = reg(el .*) n reg(e2 .*) 

If overlap (el ,e2) does not match anything then we can freely flip between el|e2 and 
e2 1 el. Also note that if this occurs a choice is deterministic and we do not have to backtrack 
if first alternative happens. 

Mizushima |15] also transforms grammar to more deterministic one. We use stronger anal- 
ysis. Using overlap we can determine where we can insert return states that inform Switch 
that next alternatives can not occuJl. This transformation is quite technical and beyond scope 
of this paper. 

While bounds minsize (e) ,maxsize (e) on minimal and maximal sizes of string that matches 
e can be discovered by intersecting with suitable languages it is faster to compute them directly 
by depth first search. 

Functor reg can be defined in the following way: 
reg( 'c' ) = c 

reg( r ) = reg(r) 

reg( a* ) = reg(a)* 

reg( nested(start ,mid, end) ) = reg(start) .* reg(end) 
reg( a b ) = reg(a) reg(b) 

reg( a|b) = reg(a) I reg(b) 

reg(&a b ) = reg(a) fl reg(b) 

We use rough approximation of middle of nested. In typical case inside nesting could be 
practically anything so trying to improve this approximation leads only to larger expressions 
without any new insights. 

We shall remark that better result can be obtained by first using relativized regular machine 
and then converting to regular machine. This gives two advantages. 

First is that Switch describes also lookaheads and we can describe intersection by lookahead. 

Second is that we can use facts: 
If A is unambiguous then A B fl A C = A ( B n C). 



"We can also consider continuations for better results 
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If A is unambiguous then A B|AC = A(B|C). 

As there only finitely many (continuation, cuts, stops) triples size of our machine is finite. 

Note that we can test emptiness more effectively when we construct finite state machines 
lazily. 

We do not include optimizations using reg functor in this paper but in separate technical 
report [3]. 

9 Problems of left recursion 

Left recursion handling deserves topic of its own. Various approaches were suggested and 
various counterexamples found. 

In PEG implementing left recursion correctly is an impossible task. Consider rule: 

L = &( L 'cd' ) 'abc' # a -> abc -> abcbc -> ab 
I &( L 'bed' ) 'ab' # " | 

I L 'be' # I V 

I L 'cb' # abebeb <- abeb 

I 'a' 

On "abebebed". 

It creates infinite cycle in the recursion. This problem is more fundamental as there is an 
paradox: 

L = ~L 

We reject such self references and raise a error when lookahead refers to possibly indirectly 
left recursive rule. Note that in boolean grammars same problem was recognized [17) . 

Left recursion can be handled by recursive descend/ ascend. A rule: 
L = L 'be' I L 'c' I 'ab' I 'a' 
on "abc" is recognized as "(a(bc))" by recursive descend parser but as "((ab)c)" by recursive 
ascend one. All previous approaches in PEG and context-free bottom-up parser used a recursive 
ascend variant of left recursion. A simplest algorithm is attributed to Paull [1 ]. It consist of 
rewriting direct left recursion to equivalent rule: 
L = La|b|Lc|d 
L = (b I d) (ale)*. 

An indirect left recursion is removed by inlining and thus reducing to direct recursion case. 

In 1965 Kuno [T2] suggested to limit recursion depth by n. It was rejected in PEG setting 
as in presence of semantic predicates some recursive rules need more than n calls. Also it was 
not clear how handle infinite streams. But it was rejected prematurely. 

Using reg functor (or simple dataflow) we can for each expression compute lower bound 
on minimal length of a string that matches that expression. Using this information we can 
easily estimate minimal size of current continuation. When this bound exceeds the length of 
our string we can fail. 

For infinite streams we can guess bound by guessing initially 1 and doubling bound when 
recursion could continue. We do not use this approach as it has an exponential complexity in 
worst case. 

Note that same technique can improve to Frost's algorithm [5]. 

In packrat setting Ford used Paull algorithm to remove direct left recursion. He rejected to 
support left recursion with the following reason [8] : 
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"At least until left recursion in TDPL is studied further, utilizing such a feature would 
amount to opening a syntactic Pandora's Box, which clearly defeats the pragmatic purpose for 
which the simple left recursion transformation is provided." 

Warth, Douglass, Millstein [21] attempted to add runtime detection of left recursion. With 
bit of imagination it could be interpreted as doing Paull algorithm at runtime. However this 
approach has several flaws. 

One discovered by Tratt [22] is that seed growing introduces ambiguity of direct left recursion 
when right recursive alternative is also present. 

A revised algorithm of Tratt still contains a flaw. Tratt at certain times forbids expansion 
of right recursion. 

Tratt approach fails to handle right-recursive lookahead as following counterexample shows. 

L = L 'a' 

I ~('b' L) 'b' 
I 'c' 

Third issue was discovered by Peter Goodman 10 a . Warth algorithm does not handle 
following grammar. 

A = A 'a' / B 

B = B 'b' / A / C 

C = C 'c' I B / 'd' 

Mcdciros in unpublished paper [14] devised a revised version of seed growing algorithm. 

One of possible advantages of seed growing could be support of higher order parametrized 
rules. Authors Amethyst parser can in practical setting resolve higher order functions making 
this point a moot one. 

9.1 Left recursion in REG REG parser 

We combine two techniques. First we just rewrite recursion by Paull algorithm. A second 
technique is that continuation passing style does implicit finite state machine minimization. 
This is simpler and leads to smaller grammars than Moore's left corner transform heuristic 

We handle left recursion inside iteration by unrolling one level. 

With some bookkeeping we can transform left recursion to recursive descend. Idea is that 
each alternative returns its derivation an we choose a lexicographically smallest in recursion 
tree. This can be done in O(l) time using dynamic lowest common ancestor [7]. 

10 Summary 

We introduced notion of structured grammars that allow generation of practical linear time 
parsers. Our REG REG class generalizes PEG and is more suitable for various optimizations. 
We wrote C implementation as proof of concept. 

A reg functor gives us unified framework for many optimizations. 

We also integrated left recursion into our parser. 

Author also developed a dynamic parser of structured grammars [3J. The dynamic parser 
allows to modify string and ask for updated parse results Our dynamic parser recomputes only 
rules that it must recompute. For practical grammars overhead is within constant factor of 
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time spend on recomputing rules. On worst case we multiply time spend on recomputing rules 
by logarithmic factor. 

Structured grammar are promising for integration with IDE. With information that struc- 
tured grammars expose we can automatically offer code folding, error reporting and syntax 
highlighter [5]. 

Author developed structured grammars as part of Amethyst language. Amethyst generalizes 
pattern matching in several directions as is described in authors thesis [2]. 
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